In this EDA, I explore a dataset of airline on-time performance to try and find insights to flight delay and cancellations. The dataset used is a very large dataset that consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. There are over 120 million observations (flights) in this dataset for flights. The data was compressed into individual CSV files for each year.
I chose to explore this particular dataset because it would allow me learn new skills and optimization techniques for handling large datasets.
Due to the size of this dataset, it would very difficult to load the data into a Pandas dataframe in memory without reducing it to a very small subset of the data, so I decided to employ the use of R markdown (instead of Jupyter notebook) so that I can use R packages, along with SQL queries, to wrangle the data into a more summarized format that a Pandas dataframe can handle.
library(tidyverse)
library(skimr)
library(dplyr)
library(here) # To locate files based on current working directory
library(janitor) # Tools for for examining and cleaning dirty data.
library(reticulate) # For reading R objects in Python
library(data.table) # For reading large datasets efficiently
library(inborutils) # For reading CSV files and converting to SQL
library(DBI) # Interface to connect with SQL databases
library(RSQLite) # For connecting with SQL databasesimport pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pathlib
import osThis dataset contains 21 large CSV files of flight data for each year from 1987 to 2008, as well as some other CSV files for which contain extra information. I will not directly read any of the large CSV files because it would take too much memory. I will read in the other CSV files, then use R libraries and SQL queries to read in smaller samples of the data to explore.
path_column_data = "airline/airline_dataset_column_info.csv"
path_airport_data = "airline/other_data/airports.csv"
path_carrier_data = "airline/other_data/carriers.csv"
path_plane_data = "airline/other_data/plane-data.csv"
path_main = "airline/main_data" # Path to the main (yearly) CSV files
def absolute_file_paths(directory):
data = {}
files = os.listdir(directory)
paths = [f"{path_main}/{file}" for file in files]
for idx in range(len(files)):
name = files[idx].split('.')[0]
data[name] = paths[idx]
return data
main_files = absolute_file_paths(path_main)# Data on column descriptions for the main files
column_data = pd.read_csv(path_column_data)
# Data on different airports
airport_data = pd.read_csv(path_airport_data)
# Airline companies
carrier_data = pd.read_csv(path_carrier_data)
# Plane data, specifications and other info
plane_data = pd.read_csv(path_plane_data)# Info on column descriptions for the main files
column_data <- read.csv(py$path_column_data)
# Data on different airports
airport_data <- read.csv(py$path_airport_data)
# Information on airline companies
carrier_data <- read.csv(py$path_carrier_data)
# Plane data, specifications and other info
plane_data <- read.csv(py$path_plane_data)For the main data, I have written a script that reads in the data in smaller chunks and stores them in a database file (sqlite). Each year’s data is stored in its own table. I also store the other data in their own tables so that later, when needed, I can reference them using SQL joins. The process takes a while to run because of the large dataset (over 30 minutes on my PC).
Also, the original files are named by the year they represent. It is not be good practice to name a database table starting with a number, so the script adds a prefix to each name.
For now, I have added a condition so the code will only be executed if the sqlite file is not detected in the project root directory.
path_main <- "airline/main_data"
db_file <- "airline_data.sqlite"
save_in_sql <- function() {
main_files <- list.files(path = path_main, full.names = TRUE)
if (!file.exists(db_file)) {
# Creating the airport data table
inborutils::csv_to_sqlite(
csv_file = py$path_airport_data,
table_name = "airports",
sqlite_file = db_file,
show_progress_bar = FALSE)
# Creating the carrier data table
inborutils::csv_to_sqlite(
csv_file = py$path_carrier_data,
table_name = "carriers",
sqlite_file = db_file,
show_progress_bar = FALSE)
# Creating the plane data table
inborutils::csv_to_sqlite(
csv_file = py$path_plane_data,
table_name = "planes",
sqlite_file = db_file,
show_progress_bar = FALSE)
# Creating the tables for each of the years' data
for (csv in main_files) {
csv_name <- strsplit(csv, "/|[.]")[[1]] # Splitting the csv name by "/" or "."
csv_name <- csv_name[length(csv_name)-1] # Getting the second last element of the list
table_name <- paste("table", csv_name, sep="_")
print("Updating table: %s", table_name)
inborutils::csv_to_sqlite(
csv_file = csv,
sqlite_file = db_file,
table_name = table_name,
pre_process_size = 1000,
chunk_size = 50000,
show_progress_bar = TRUE)
}
}
}
save_in_sql()Now I inspect the database file to be sure that all tables have been added and updated properly
airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
db_tables <- dbListTables(airline_db) # List out the tables in the db
print(db_tables)## [1] "airports" "carriers" "planes" "sqlite_stat1" "sqlite_stat4"
## [6] "table_1987" "table_1988" "table_1989" "table_1990" "table_1991"
## [11] "table_1992" "table_1993" "table_1994" "table_1995" "table_1996"
## [16] "table_1997" "table_1998" "table_1999" "table_2000" "table_2001"
## [21] "table_2002" "table_2003" "table_2004" "table_2005" "table_2006"
## [26] "table_2007" "table_2008"
db_1993_cols <- dbListFields(airline_db, "table_1993") # Column names for specific table in db
print(length(db_1993_cols))## [1] 29
We can see from the above result that there are 29 columns in the table and this is the same across all the tables (the yearly tables), they all have the same columns, but we don’t know exactly how many rows are in each table.
The code below is a script/query to return exactly the number of rows (observations) that are in each table. The query can take a few minutes to execute the first time.
count_rows <- function() {
Table = character() # Empty vector/list to store table names
Row_Count = integer() # Empty vector/list to store row counts
for (table in db_tables) {
query_rows <- sprintf("SELECT COUNT(*) AS Rows FROM %s", table)
row_count <- dbGetQuery(airline_db, query_rows)[[1]]
Table <- c(Table, table) # Appending each table name to the vector
Row_Count <- c(Row_Count, row_count) # Appending each row count to the vector
}
df_row_count <- data.frame(Table, Row_Count)
return(df_row_count)
}
table_row_count <- count_rows()
table_row_count## Table Row_Count
## 1 airports 3376
## 2 carriers 1491
## 3 planes 5029
## 4 sqlite_stat1 25
## 5 sqlite_stat4 0
## 6 table_1987 1311826
## 7 table_1988 5202096
## 8 table_1989 5041200
## 9 table_1990 5270893
## 10 table_1991 5076925
## 11 table_1992 5092157
## 12 table_1993 5070501
## 13 table_1994 5180048
## 14 table_1995 5327435
## 15 table_1996 5351983
## 16 table_1997 5411843
## 17 table_1998 5384721
## 18 table_1999 5527884
## 19 table_2000 5683047
## 20 table_2001 5967780
## 21 table_2002 5271359
## 22 table_2003 6488540
## 23 table_2004 7129270
## 24 table_2005 7140596
## 25 table_2006 7141922
## 26 table_2007 7453215
## 27 table_2008 2389217
We can now see the number of rows in each table, which sums up to over 120 million observations. To test the SQL connection, I load in the first 500 rows of data from a particular year (2005 dataset in this case) using SQL and the R interface
query_test <- "SELECT * FROM table_2003 LIMIT 10"
tbl(airline_db, sql(query_test)) # Runs the query and displays results without loading it in memory## # Source: SQL [10 x 29]
## # Database: sqlite 3.39.2 [O:\GitHub\data_analysis\alx\airline_performance_analysis\airline_data.sqlite]
## Year Month DayofMo…¹ DayOf…² DepTime CRSDe…³ ArrTime CRSAr…⁴ Uniqu…⁵ Fligh…⁶
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 2003 1 29 3 1651 1655 1912 1913 UA 1017
## 2 2003 1 30 4 1654 1655 1910 1913 UA 1017
## 3 2003 1 31 5 1724 1655 1936 1913 UA 1017
## 4 2003 1 1 3 1033 1035 1625 1634 UA 1018
## 5 2003 1 2 4 1053 1035 1726 1634 UA 1018
## 6 2003 1 3 5 1031 1035 1640 1634 UA 1018
## 7 2003 1 4 6 1031 1035 1626 1634 UA 1018
## 8 2003 1 5 7 1035 1035 1636 1634 UA 1018
## 9 2003 1 6 1 1031 1035 1653 1634 UA 1018
## 10 2003 1 1 3 1713 1710 1851 1847 UA 1020
## # … with 19 more variables: TailNum <chr>, ActualElapsedTime <dbl>,
## # CRSElapsedTime <dbl>, AirTime <dbl>, ArrDelay <dbl>, DepDelay <dbl>,
## # Origin <chr>, Dest <chr>, Distance <dbl>, TaxiIn <dbl>, TaxiOut <dbl>,
## # Cancelled <dbl>, CancellationCode <int>, Diverted <dbl>,
## # CarrierDelay <int>, WeatherDelay <int>, NASDelay <int>,
## # SecurityDelay <int>, LateAircraftDelay <int>, and abbreviated variable
## # names ¹DayofMonth, ²DayOfWeek, ³CRSDepTime, ⁴CRSArrTime, ⁵UniqueCarrier, …
top_rows <- dbGetQuery(airline_db, query_test) # Runs the query and stores it in a dataframe, in memory♂
dbDisconnect(airline_db) # Disconnect from the database when doneNow lets look at some summary statistics for the data. I will read in the first 5,000 rows of the 2007 flight data.
top_2007 = pd.read_csv(main_files['2007'], nrows=5000)
top_2007.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 5000 entries, 0 to 4999
## Data columns (total 29 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Year 5000 non-null int64
## 1 Month 5000 non-null int64
## 2 DayofMonth 5000 non-null int64
## 3 DayOfWeek 5000 non-null int64
## 4 DepTime 4964 non-null float64
## 5 CRSDepTime 5000 non-null int64
## 6 ArrTime 4963 non-null float64
## 7 CRSArrTime 5000 non-null int64
## 8 UniqueCarrier 5000 non-null object
## 9 FlightNum 5000 non-null int64
## 10 TailNum 5000 non-null object
## 11 ActualElapsedTime 4963 non-null float64
## 12 CRSElapsedTime 5000 non-null int64
## 13 AirTime 4963 non-null float64
## 14 ArrDelay 4963 non-null float64
## 15 DepDelay 4964 non-null float64
## 16 Origin 5000 non-null object
## 17 Dest 5000 non-null object
## 18 Distance 5000 non-null int64
## 19 TaxiIn 5000 non-null int64
## 20 TaxiOut 5000 non-null int64
## 21 Cancelled 5000 non-null int64
## 22 CancellationCode 36 non-null object
## 23 Diverted 5000 non-null int64
## 24 CarrierDelay 5000 non-null int64
## 25 WeatherDelay 5000 non-null int64
## 26 NASDelay 5000 non-null int64
## 27 SecurityDelay 5000 non-null int64
## 28 LateAircraftDelay 5000 non-null int64
## dtypes: float64(6), int64(18), object(5)
## memory usage: 1.1+ MB
Most of the columns are numeric, some indicating arrival and departure, as well as different causes of delays. There are some binary columns such as “Cancelled” and “Diverted” which are important variables to analyze.
For this EDA, I am interested in exploring some of the ideas suggested on the source website which are:
Generally, I am interested in exploring the cause of flight delays and cancellations.
For this investigation, I will get the best insights by focusing on the “delay” columns. By analyzing the delays on each day of the week and each month, I believe I can get a good idea of the best times to fly. I will explore data for a single year. Then later on, I will compare the data across the other years to see if there are similar patterns across the years.
I will start by analyzing the reason for flight cancellations:
CancellationCode column. I will be using the 2007 dataset.
Because the data is so large, I will only read in some select columns
into the dataframe.
get_cancelled_flights <- function(year){
airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
s <- sprintf("CASE
WHEN DayOfWeek = 1 Then 'Monday'
WHEN DayOfWeek = 2 Then 'Tuesday'
WHEN DayOfWeek = 3 Then 'Wednesday'
When DayOfWeek = 4 Then 'Thursday'
When DayOfWeek = 5 Then 'Friday'
When DayOfWeek = 6 Then 'Saturday'
When DayOfWeek = 7 Then 'Sunday'
END AS Day")
query_cc <- sprintf("SELECT Month, DayofMonth, DayOfWeek, %s, CancellationCode, COUNT(*) AS Flights
FROM table_%s
WHERE Cancelled = 1
GROUP BY Month, DayofMonth, DayOfWeek, Day, CancellationCode", s, year)
cancelled_flights <- dbGetQuery(airline_db, query_cc) # Runs the query and stores it in a dataframe, in memory♂
dbDisconnect(airline_db) # Disconnect from the database when done
return(cancelled_flights)
}
cancelled_flights_2007 = get_cancelled_flights('2007')
head(cancelled_flights_2007)## Month DayofMonth DayOfWeek Day CancellationCode Flights
## 1 1 1 1 Monday A 268
## 2 1 1 1 Monday B 56
## 3 1 1 1 Monday C 35
## 4 1 2 2 Tuesday A 167
## 5 1 2 2 Tuesday B 21
## 6 1 2 2 Tuesday C 23
# Converting R dataframe to Pandas dataframe
cancelled_flights_2007 = pd.DataFrame(r.cancelled_flights_2007)
cancelled_flights_2007.info()
# Changing numeric data types to integer## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1095 entries, 0 to 1094
## Data columns (total 6 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Month 1095 non-null float64
## 1 DayofMonth 1095 non-null float64
## 2 DayOfWeek 1095 non-null float64
## 3 Day 1095 non-null object
## 4 CancellationCode 1095 non-null object
## 5 Flights 1095 non-null int32
## dtypes: float64(3), int32(1), object(2)
## memory usage: 47.2+ KB
cancelled_flights_2007 = cancelled_flights_2007.astype({'Month':'int8', 'DayofMonth':'int8', 'DayOfWeek':'int8'})
cancelled_flights_2007.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1095 entries, 0 to 1094
## Data columns (total 6 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Month 1095 non-null int8
## 1 DayofMonth 1095 non-null int8
## 2 DayOfWeek 1095 non-null int8
## 3 Day 1095 non-null object
## 4 CancellationCode 1095 non-null object
## 5 Flights 1095 non-null int32
## dtypes: int32(1), int8(3), object(2)
## memory usage: 24.7+ KB
labels = ['Carrier', 'Weather', 'NAS', 'Security']
sns.set_theme(style="darkgrid", palette=None, font_scale=1.5)
base_color = sns.color_palette("husl", 9)
def pie_cancel():
plt.figure(figsize=[16, 10])
cg = cancelled_flights_2007.groupby("CancellationCode").sum()["Flights"]
plt.pie(x=cg, labels=labels, startangle=90, counterclock=False, autopct='%.0f%%');
plt.axis("square")
plt.show();
pie_cancel()We can see from the chart that most of the cancelled flights in 2007 were as a result of carrier delays, followed closely by weather delays. Lets see if its the same trend across other years. I will be working with a 4 year period (2005 to 2008).
cancelled_flights_2005 = get_cancelled_flights('2005')
cancelled_flights_2006 = get_cancelled_flights('2006')
cancelled_flights_2007 = get_cancelled_flights('2007')
cancelled_flights_2008 = get_cancelled_flights('2008')cancelled_flights_2005 = pd.DataFrame(r.cancelled_flights_2005)
cancelled_flights_2006 = pd.DataFrame(r.cancelled_flights_2006)
cancelled_flights_2007 = pd.DataFrame(r.cancelled_flights_2007)
cancelled_flights_2008 = pd.DataFrame(r.cancelled_flights_2008)cancelled_list = [cancelled_flights_2005, cancelled_flights_2006, cancelled_flights_2007, cancelled_flights_2008]
cancelled_list_str = ['cancelled_flights_2005', 'cancelled_flights_2006', 'cancelled_flights_2007', 'cancelled_flights_2008']
def save_to_csv(df_list):
for idx, df in enumerate(df_list):
os.makedirs('airline/saved_data/', exist_ok=True)
df.to_csv(f'airline/saved_data/{cancelled_list_str[idx]}.csv', index=False)
# save_to_csv(cancelled_list)def pie_multi():
plt.figure(figsize=[16, 10])
plt.subplot(2, 2, 1)
cg3 = cancelled_flights_2005.groupby("CancellationCode").sum()["Flights"]
plt.pie(x=cg3, labels=labels, startangle=90, counterclock=False, autopct='%.0f%%');
plt.subplot(2, 2, 2)
cg4 = cancelled_flights_2006.groupby("CancellationCode").sum()["Flights"]
plt.pie(x=cg4, labels=labels, startangle=90, counterclock=False, autopct='%.0f%%');
plt.subplot(2, 2, 3)
cg5 = cancelled_flights_2007.groupby("CancellationCode").sum()["Flights"]
plt.pie(x=cg5, labels=labels, startangle=90, counterclock=False, autopct='%.0f%%');
plt.subplot(2, 2, 4)
cg6 = cancelled_flights_2008.groupby("CancellationCode").sum()["Flights"]
plt.pie(x=cg6, labels=labels, startangle=90, counterclock=False, autopct='%.0f%%');
plt.show()
pie_multi()We can see that carrier is consistently a major reason
for cancelled flights followed by weather.
Here I try to see the delay times to see the most common delay duration.
df_2007 = pd.read_csv(main_files['2007'], usecols = ['Month', 'DayofMonth', 'ArrDelay', 'DepDelay', 'Distance'], nrows = 2_000_000)
df_2007.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2000000 entries, 0 to 1999999
## Data columns (total 5 columns):
## # Column Dtype
## --- ------ -----
## 0 Month int64
## 1 DayofMonth int64
## 2 ArrDelay float64
## 3 DepDelay float64
## 4 Distance int64
## dtypes: float64(2), int64(3)
## memory usage: 76.3 MB
df_2007.describe()
# Check for missing values## Month DayofMonth ArrDelay DepDelay Distance
## count 2.000000e+06 2.000000e+06 1.935147e+06 1.939160e+06 2.000000e+06
## mean 2.182453e+00 1.555414e+01 1.050200e+01 1.184517e+01 7.000558e+02
## std 9.717774e-01 8.695236e+00 3.906322e+01 3.582602e+01 5.458290e+02
## min 1.000000e+00 1.000000e+00 -3.120000e+02 -3.050000e+02 1.100000e+01
## 25% 1.000000e+00 8.000000e+00 -9.000000e+00 -4.000000e+00 3.080000e+02
## 50% 2.000000e+00 1.600000e+01 0.000000e+00 0.000000e+00 5.460000e+02
## 75% 3.000000e+00 2.300000e+01 1.500000e+01 1.200000e+01 9.300000e+02
## max 4.000000e+00 3.100000e+01 1.564000e+03 1.547000e+03 4.962000e+03
df_2007['DepDelay'].isnull().sum()## 60840
df_2007[df_2007.ArrDelay.notnull()]## Month DayofMonth ArrDelay DepDelay Distance
## 0 1 1 1.0 7.0 389
## 1 1 1 8.0 13.0 479
## 2 1 1 34.0 36.0 479
## 3 1 1 26.0 30.0 479
## 4 1 1 -3.0 1.0 479
## ... ... ... ... ... ...
## 1999995 4 30 22.0 0.0 741
## 1999996 4 30 48.0 50.0 1024
## 1999997 4 30 -12.0 -1.0 646
## 1999998 4 30 12.0 20.0 589
## 1999999 4 30 -21.0 -5.0 461
##
## [1935147 rows x 5 columns]
# Changing column data types to reduce memory usage
df_2007 = df_2007.astype({'Month':'int8', 'DayofMonth':'int8', 'ArrDelay':'float32', 'DepDelay':'float32'})def delay_hist():
binsize = 30
bins = np.arange(0, df_2007['DepDelay'].max()+binsize, binsize)
plt.figure(figsize=[16, 10]);
plt.hist(data = df_2007, x = 'DepDelay', bins = bins);
plt.xlabel('Departure Delays (mins)');
plt.show()
delay_hist()The distribution is skewed to the left and there is a short tail. A large majority of the data falls within the range of 0 and 250 minutes. I would have gone for a logarithmic scale but this data has negative values (because there are flights that took off before the expected departure time). There are also some missing values.
# Flights that took off over 25 minutes earlier
df_2007[df_2007.DepDelay < -25].info()
# Lowest departure delay## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 165 entries, 99831 to 1977687
## Data columns (total 5 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Month 165 non-null int8
## 1 DayofMonth 165 non-null int8
## 2 ArrDelay 163 non-null float32
## 3 DepDelay 165 non-null float32
## 4 Distance 165 non-null int64
## dtypes: float32(2), int64(1), int8(2)
## memory usage: 4.2 KB
df_2007[df_2007.DepDelay < 0].DepDelay.min()## -305.0
We can see that there are many flights that took off before the expected departure time (almost 50% of all the flights in that year). That is not unusual, especially if it falls within a few minutes and all passengers are available, but there are many flights that took off unusually early (over 30 minutes early, even up to 5 hours early). There can be many reasons for this but for now since this exploration is mainly focused on delay times and there are so many records to work with, I will only assess flights that were actually delayed,
df_2007_delayed = df_2007[df_2007.DepDelay > 0]
df_2007_delayed.info()## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 867613 entries, 0 to 1999998
## Data columns (total 5 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Month 867613 non-null int8
## 1 DayofMonth 867613 non-null int8
## 2 ArrDelay 865260 non-null float32
## 3 DepDelay 867613 non-null float32
## 4 Distance 867613 non-null int64
## dtypes: float32(2), int64(1), int8(2)
## memory usage: 21.5 MB
df_2007_delayed.describe()## Month DayofMonth ... DepDelay Distance
## count 867613.000000 867613.000000 ... 867613.000000 867613.000000
## mean 2.199465 15.268106 ... 31.058889 732.023085
## std 0.961845 8.696432 ... 46.776241 554.655358
## min 1.000000 1.000000 ... 1.000000 11.000000
## 25% 1.000000 8.000000 ... 5.000000 328.000000
## 50% 2.000000 15.000000 ... 15.000000 587.000000
## 75% 3.000000 22.000000 ... 38.000000 967.000000
## max 4.000000 31.000000 ... 1547.000000 4962.000000
##
## [8 rows x 5 columns]
Now I will try to plot using a log scale
def log_hist():
log_binsize = 0.10
bins = 10 ** np.arange(0, np.log10(df_2007_delayed['DepDelay'].max())+log_binsize, log_binsize)
plt.figure(figsize=[16, 10])
plt.hist(data = df_2007_delayed, x = 'DepDelay', bins = bins)
plt.xscale('log')
# plt.xticks([500, 1e3, 2e3, 5e3, 1e4, 2e4], [50, '100', '200', '400', '800', '1600'])
plt.xlabel('Departure Delays (mins)')
plt.show()
log_hist()I will look at the departure delays and arrival delays. Specifically, I will focus on the arrival delay and its relationship with distance. I want to see if longer flights tend to have longer delays.
df_2007_sampled = df_2007.dropna(subset=['ArrDelay', 'Distance']).sample(n=20000, replace = False)
def dist_vs_arr():
# Dropping rows with missing values
print("Original rows and columns =",df_2007.shape)
print("Sampled rows and columns =",df_2007_sampled.shape)
print(df_2007_sampled.info())
print(df_2007_sampled.head())
plt.figure(figsize=[12, 12])
plt.scatter(data=df_2007_sampled, x='Distance', y='ArrDelay')
plt.xlabel("Distance in miles")
plt.ylabel("Arrival Delay in minutes")
plt.show()
dist_vs_arr()## Original rows and columns = (2000000, 5)
## Sampled rows and columns = (20000, 5)
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 20000 entries, 1917893 to 1931736
## Data columns (total 5 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Month 20000 non-null int8
## 1 DayofMonth 20000 non-null int8
## 2 ArrDelay 20000 non-null float32
## 3 DepDelay 20000 non-null float32
## 4 Distance 20000 non-null int64
## dtypes: float32(2), int64(1), int8(2)
## memory usage: 507.8 KB
## None
## Month DayofMonth ArrDelay DepDelay Distance
## 1917893 4 30 3.0 -2.0 236
## 769614 2 8 13.0 15.0 793
## 985675 2 3 109.0 89.0 1266
## 1699625 3 27 17.0 11.0 1345
## 863445 2 17 206.0 229.0 1515
There seems to be very little correlation from this plot. Since there are a lot of overlapping points, I will apply some transparency to get a better picture.
def dist_vs_arr_blur():
plt.figure(figsize=[12, 12])
sns.regplot(data=df_2007_sampled, x='Distance', y='ArrDelay', scatter_kws = {'alpha': 1/5}, fit_reg = False);
plt.xlabel("Distance in miles")
plt.ylabel("Arrival Delay in minutes")
plt.show()
dist_vs_arr_blur()Most of the flights had distances between the range of 2500 miles and delays within 400 minutes. Again I will be removing the negative values since I am only interested in flights that arrived later than expected.
df_2007_sampled = df_2007_sampled.query("0 < ArrDelay < 200")
def dist_vs_arr_focus():
plt.figure(figsize=[12, 12])
sns.regplot(data=df_2007_sampled, x='Distance', y='ArrDelay', scatter_kws = {'alpha': 1/5}, fit_reg = False);
plt.show();
dist_vs_arr_focus()Many of the flights with shorter distances seemed to have shorter delays since the density of the dots seem to fade as it gets away from the 0 mark. Lets try to get a different picture with another plot.
def heat_plot():
x_bins = np.arange(0, 4500+100, 100)
y_bins = np.arange(0, 200+7, 7)
plt.figure(figsize=[12, 12])
plt.hist2d(data=df_2007_sampled, x='Distance', y='ArrDelay', cmin=0.3, cmap='viridis_r', bins = [x_bins, y_bins]);
plt.show();
heat_plot()## <string>:6: MatplotlibDeprecationWarning: Auto-removal of grids by pcolor() and pcolormesh() is deprecated since 3.5 and will be removed two minor releases later; please call grid(False) first.
Its still the same story from what I can see here.
Next, I will analyze the delays on a day-of-week
basis.
get_mean_delay_year <- function(year) {
airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
# Query for mean delay times in January (actual delays, no early flights)
query_jan <- sprintf("SELECT DayOfWeek, DayOfMonth,
AVG(DepDelay) AS MeanDepDelay,
AVG(ArrDelay) AS MeanArrDelay
FROM table_%s
WHERE DepDelay > 0 AND ArrDelay > 0
GROUP BY DayOfWeek, DayOfMonth", year)
# tbl(airline_db, sql(query_jan)) # Runs the query and displays results without loading it in memory
delays_2007 <- dbGetQuery(airline_db, query_jan) # Runs the query and stores it in a dataframe, in memory♂
dbDisconnect(airline_db) # Disconnect from the database when done
return(delays_2007)
}delays_2007 <- get_mean_delay_year('2007')
head(delays_2007)## DayOfWeek DayofMonth MeanDepDelay MeanArrDelay
## 1 1 1 35.50271 35.54895
## 2 1 2 40.18925 41.42904
## 3 1 3 38.94795 39.51203
## 4 1 4 46.83710 49.46875
## 5 1 5 36.06113 37.46486
## 6 1 6 42.34605 44.32375
skim(delays_2007)| Name | delays_2007 |
| Number of rows | 216 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| DayOfWeek | 0 | 1 | 3.99 | 2.00 | 1.00 | 2.00 | 4.00 | 6.00 | 7.00 | ▇▃▃▃▇ |
| DayofMonth | 0 | 1 | 15.93 | 8.93 | 1.00 | 8.00 | 16.00 | 24.00 | 31.00 | ▇▇▇▇▇ |
| MeanDepDelay | 0 | 1 | 38.37 | 6.48 | 24.67 | 33.25 | 37.68 | 41.89 | 61.84 | ▃▇▆▁▁ |
| MeanArrDelay | 0 | 1 | 40.66 | 7.88 | 24.93 | 34.60 | 39.75 | 45.29 | 70.32 | ▅▇▅▁▁ |
For the DayOfWeek data, I will make another column in
the dataframe that shows the text representation (Monday, Tuesday …) so
that it would be easier to understand in the plot.
def change_column_type(df):
# Converting R dataframe to Pandas dataframe
df_delays = pd.DataFrame(df)
# Changing day and month columns from float to integer data types
df_delays = df_delays.astype({'DayOfWeek':'int8', 'DayofMonth':'int8'})
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Creating the new column
df_delays['Day'] = df_delays['DayOfWeek'].apply(lambda x: days_of_week[x-1])
return(df_delays)
delays_2007 = change_column_type(r.delays_2007)
delays_2007.head()## DayOfWeek DayofMonth MeanDepDelay MeanArrDelay Day
## 0 1 1 35.502709 35.548949 Monday
## 1 1 2 40.189248 41.429041 Monday
## 2 1 3 38.947953 39.512035 Monday
## 3 1 4 46.837098 49.468746 Monday
## 4 1 5 36.061127 37.464863 Monday
def plot_bar_err_single():
plt.figure(figsize=[16,12])
sns.barplot(data=delays_2007, x='Day', y='MeanDepDelay')
plt.title("Average Flight Delay Times In 2007")
plt.xlabel("Day of Week")
plt.ylabel("Average Departure Delay (min)")
plt.show();
plot_bar_err_single()From the above chart alone, the average delay times for each weekday seems to be fairly the same. I will see if there is a consistent pattern across the years by plotting the chart for 9 consecutive years (2000 - 2008)
I left the colors because, even though the weekday variable is ordinal (i.e. Tuesday comes after Monday and so on), the order doesn’t really matter much in this case because, for example, Friday is not better than Sunday, Monday is not higher than Saturday, etc. The colors will be helpful in identifying each weekday in the subsequent plots.
delays_2000 <- get_mean_delay_year('2000')
delays_2001 <- get_mean_delay_year('2001')
delays_2002 <- get_mean_delay_year('2002')
delays_2003 <- get_mean_delay_year('2003')
delays_2004 <- get_mean_delay_year('2004')
delays_2005 <- get_mean_delay_year('2005')
delays_2006 <- get_mean_delay_year('2006')
delays_2007 <- get_mean_delay_year('2007')
delays_2008 <- get_mean_delay_year('2008')delays_2000 = change_column_type(r.delays_2000)
delays_2001 = change_column_type(r.delays_2001)
delays_2002 = change_column_type(r.delays_2002)
delays_2003 = change_column_type(r.delays_2003)
delays_2004 = change_column_type(r.delays_2004)
delays_2005 = change_column_type(r.delays_2005)
delays_2006 = change_column_type(r.delays_2006)
delays_2007 = change_column_type(r.delays_2007)
delays_2008 = change_column_type(r.delays_2008)delays_list = [delays_2000, delays_2001, delays_2002, delays_2003, delays_2004, delays_2005, delays_2006, delays_2007, delays_2008]
delays_list_str = ['delays_2000', 'delays_2001', 'delays_2002', 'delays_2003', 'delays_2004', 'delays_2005', 'delays_2006', 'delays_2007', 'delays_2008']
def save_to_csv(df_list):
for idx, df in enumerate(df_list):
os.makedirs('airline/saved_data/', exist_ok=True)
df.to_csv(f'airline/saved_data/{delays_list_str[idx]}.csv', index=False)
# save_to_csv(delays_list)def bar_plot_err_multi():
fig, ax = plt.subplots(ncols = 3, nrows = 3 , figsize = [22,17])
sns.barplot(data=delays_2000, x='Day', y='MeanDepDelay', ax = ax[0, 0])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2001, x='Day', y='MeanDepDelay', ax = ax[1, 0])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2002, x='Day', y='MeanDepDelay', ax = ax[2, 0])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2003, x='Day', y='MeanDepDelay', ax = ax[0, 1])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2004, x='Day', y='MeanDepDelay', ax = ax[1, 1])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2005, x='Day', y='MeanDepDelay', ax = ax[2, 1])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2006, x='Day', y='MeanDepDelay', ax = ax[0, 2])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2007, x='Day', y='MeanDepDelay', ax = ax[1, 2])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2008, x='Day', y='MeanDepDelay', ax = ax[2, 2])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
plt.show();
bar_plot_err_multi() From an overview of the chart above, there is no consistent trend to predict which weekdays have more delays. Though if we look closely, it looks like Thursday generally has more delays but its not very obvious. Also, the minimum and maximum delay times are all within a small range of 30 to 40 minutes so there is very little variation. This is expected because there are many other factors to consider like the month, the season, holidays, airport carrier, plane age and global events.
Now I will examine the cancelled and diverted flights, relative to the carriers (airline companies). I want to see if flights from a carrier tend to get get cancelled or diverted more than others.
Below is an SQL query to get all flights that were either diverted or cancelled and group them by the flight carrier. On the original table, there is a column for diverted (1 or 0) and another column for cancelled (1 or 0). I believe the data is not completely tidy because a flight that is cancelled cannot be diverted and vice-versa. So I combined them to a single column indicating whether the flight was cancelled or diverted.
get_changed_flights <- function(year) {
airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
query <- sprintf("SELECT
UniqueCarrier,
Description AS Carrier,
CASE
WHEN Diverted = 1 AND Cancelled = 0 THEN 'Diverted'
WHEN Diverted = 0 AND Cancelled = 1 THEN 'Cancelled'
END AS FlightStatus,
COUNT(*) AS Flights
FROM table_%s
LEFT JOIN carriers
ON table_%s.UniqueCarrier = carriers.Code
WHERE Diverted = 1 OR Cancelled = 1
GROUP BY UniqueCarrier, Carrier, FlightStatus
ORDER BY UniqueCarrier", year, year)
flights_changed <- dbGetQuery(airline_db, query) # Runs the query and stores it in a dataframe, in memory♂
dbDisconnect(airline_db) # Disconnect from the database when done
return(flights_changed)
}
flights_changed_2007 <- get_changed_flights('2007')
head(flights_changed_2007, 10)## UniqueCarrier Carrier FlightStatus Flights
## 1 9E Pinnacle Airlines Inc. Cancelled 7939
## 2 9E Pinnacle Airlines Inc. Diverted 726
## 3 AA American Airlines Inc. Cancelled 17924
## 4 AA American Airlines Inc. Diverted 2097
## 5 AQ Aloha Airlines Inc. Cancelled 388
## 6 AQ Aloha Airlines Inc. Diverted 15
## 7 AS Alaska Airlines Inc. Cancelled 2563
## 8 AS Alaska Airlines Inc. Diverted 499
## 9 B6 JetBlue Airways Cancelled 3710
## 10 B6 JetBlue Airways Diverted 624
Lets see the number of cancelled and diverted flights in this data (the 2007 data).
flights_changed_2007 = pd.DataFrame(r.flights_changed_2007)
print(f"Total number of diverted or cancelled flights in 2007: {flights_changed_2007.Flights.sum()}")## Total number of diverted or cancelled flights in 2007: 177927
flights_changed_2007.head()## UniqueCarrier Carrier FlightStatus Flights
## 0 9E Pinnacle Airlines Inc. Cancelled 7939
## 1 9E Pinnacle Airlines Inc. Diverted 726
## 2 AA American Airlines Inc. Cancelled 17924
## 3 AA American Airlines Inc. Diverted 2097
## 4 AQ Aloha Airlines Inc. Cancelled 388
def plot_vertical_clustered():
plt.figure(figsize=[16,12])
sns.barplot(data = flights_changed_2007, x = 'UniqueCarrier', y = 'Flights', hue = 'FlightStatus')
plt.title("Cancelled & Diverted Flights From Each Carrier In 2007")
plt.show()
plot_vertical_clustered()We can already see that MQ had the highest number of
cancelled flights (by a relatively wide margin). To better understand
the plot, I will make it horizontal and order it by number of cancelled
flights.
flights_changed_2007 = flights_changed_2007.sort_values(by='Flights', ascending=False)
def plot_horizontal_clustered():
#Sort by descending order of number of flights
plt.figure(figsize=[20,14])
sns.barplot(data = flights_changed_2007, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("Cancelled & Diverted Flights From Each Carrier In 2007")
plt.show()
plot_horizontal_clustered()Its easier to see that the airline companies with the most number of
cancelled flight is American Eagle Airlines
Inc followed by American Ailines and for
diverted flights we have American Airlines
and Southwest Airlines Co.. So we can at least judge
that the carriers at the bottom of the chart have a good record of
flight data (though I am aware that the chart could be like that because
those companies at the bottom do not have as many flights as those at
the top).
Now I’m going to plot the chart for the most recent 6 years from the
dataset, to see if this trend is the same across the years,
i.e, to see if the same companies are always on top.
flights_changed_2003 = get_changed_flights('2003')
flights_changed_2004 = get_changed_flights('2004')
flights_changed_2005 = get_changed_flights('2005')
flights_changed_2006 = get_changed_flights('2006')
flights_changed_2007 = get_changed_flights('2007')
flights_changed_2008 = get_changed_flights('2008') flights_changed_2003 =pd.DataFrame(r.flights_changed_2003).sort_values(by='Flights', ascending=False)
flights_changed_2004 =pd.DataFrame(r.flights_changed_2004).sort_values(by='Flights', ascending=False)
flights_changed_2005 =pd.DataFrame(r.flights_changed_2005).sort_values(by='Flights', ascending=False)
flights_changed_2006 =pd.DataFrame(r.flights_changed_2006).sort_values(by='Flights', ascending=False)
flights_changed_2007 =pd.DataFrame(r.flights_changed_2007).sort_values(by='Flights', ascending=False)
flights_changed_2008 =pd.DataFrame(r.flights_changed_2008).sort_values(by='Flights', ascending=False)flight_changed = [flights_changed_2003, flights_changed_2004, flights_changed_2005, flights_changed_2006, flights_changed_2007, flights_changed_2008]
flight_changed_str = ['flights_changed_2003', 'flights_changed_2004', 'flights_changed_2005', 'flights_changed_2006', 'flights_changed_2007', 'flights_changed_2008']
def save_to_csv(df_list):
for idx, df in enumerate(df_list):
os.makedirs('airline/saved_data/', exist_ok=True)
df.to_csv(f'airline/saved_data/{flight_changed_str[idx]}.csv', index=False)
# save_to_csv(flight_changed)import PyQt5
def carrier_multi():
plt.figure(figsize=(22,22))
base_color = sns.color_palette()[2]
plt.suptitle("2003 to 2006")
plt.subplot(3, 2, 1)
sns.barplot(data = flights_changed_2003, y = 'UniqueCarrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2003")
plt.subplot(3, 2, 2)
sns.barplot(data = flights_changed_2004, y = 'UniqueCarrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2004")
plt.subplot(3, 2, 3)
sns.barplot(data = flights_changed_2005, y = 'UniqueCarrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2005")
plt.subplot(3, 2, 4)
sns.barplot(data = flights_changed_2006, y = 'UniqueCarrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2006")
plt.subplot(3, 2, 5)
sns.barplot(data = flights_changed_2007, y = 'UniqueCarrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2007")
plt.subplot(3, 2, 6)
sns.barplot(data = flights_changed_2008, y = 'UniqueCarrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2008")
# Hiding the axis ticks and tick labels of the bigger plot
plt.tick_params(labelcolor="none", bottom=False, left=False)
# Adding the x-axis and y-axis labels for the bigger plot
plt.xlabel('Common X-Axis', fontsize=15, fontweight='bold')
plt.ylabel('Common Y-Axis', fontsize=15, fontweight='bold')
plt.show()
carrier_multi()For the above chart, I changed the y-labels back to the short form so that it can fit. We can see that there are companies that consistently appear among the top 5 for cancelled flights, American Eagle Airlines and American Airlines for example.
Next I want to analyze the number of delays from a general perspective. With the level of advancements in technology, I expect to see a relative reduction in the percentage of delayed flights each year.
The following query takes over 20 minutes to run so I saved the data in a CSV file after running it, so that I don’t have to rerun it unless I change something.
get_all_flights <- function() {
airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
s <- "Year,
CASE
WHEN Diverted = 1 AND Cancelled = 0 THEN 'Diverted'
WHEN Diverted = 0 AND Cancelled = 1 THEN 'Cancelled'
ELSE 'Arrived'
END AS FlightStatus,
COUNT(*) AS Flights"
query <- sprintf("
SELECT %s FROM table_1987 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1988 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1989 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1990 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1991 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1992 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1993 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1994 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1995 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1996 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1997 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1998 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1999 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2000 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2001 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2002 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2003 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2004 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2005 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2006 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2007 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2008 GROUP BY Year, FlightStatus",
s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s)
all_flights <- dbGetQuery(airline_db, query) # Runs the query and stores it in a dataframe, in memory♂
dbDisconnect(airline_db) # Disconnect from the database when done
return(all_flights)
}
# all_flights <- get_all_flights()
# write.csv(all_flights,"airline/saved_data/all_flights.csv", row.names = FALSE)all_flights <- read.csv('airline/saved_data/all_flights.csv')
head(all_flights)## Year FlightStatus Flights
## 1 1987 Arrived 1288326
## 2 1987 Cancelled 19685
## 3 1987 Diverted 3815
## 4 1988 Arrived 5137497
## 5 1988 Cancelled 50163
## 6 1988 Diverted 14436
def not_arrived():
not_arrived = pd.DataFrame(r.all_flights).query("FlightStatus != 'Arrived'")
# Only flights that did not arrive
bins = np.arange(1987, 2008+1)
plt.figure(figsize=[18,12])
base_color = sns.color_palette()[2]
sns.lineplot(data=not_arrived, x='Year', y='Flights', hue='FlightStatus')
plt.xticks(bins)
plt.show();
not_arrived()From the above, we can see that there a relatively rapid increase in cancelled flights across the years getting to a peak in 2001 followed by a drastic drop the following year and the the increase continues. One major world event around the time of that peak is the dot-com bubble which could have had easily had a major effect on flights around 2000 to 2002. The change in the diverted flights is less drastic.
Some of the interesting discoveries I made include the fact that there was a very large number of flights that were not delayed but instead left earlier than scheduled. Even if this were to happen, I wouldn’t expect it to be more than a couple of minutes since anyone that is boarding a flight is expected to be at the airport a few minutes before departure time. But I found that there are flights that departed as early as 2 hours and some even up to 5 hours before scheduled.